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A LARGE DEVIATION INEQUALITY FOR VECTOR FUNCTIONS 
ON FINITE REVERSIBLE MARKOV CHAINS 

By Vladislav Kargin 

Courant Institute of Mathematical Sciences 

Let Sjv be the sum of vector-valued functions denned on a finite 
Markov chain. An analogue of the Bernstein-Hoeffding inequality 
is derived for the probability of large deviations of Sn and relates 
the probability to the spectral gap of the Markov chain. Examples 
suggest that this inequality is better than alternative inequalities if 
the chain has a sufficiently large spectral gap and the function is 
high-dimensional. 

1. Introduction. Suppose that a system evolves according to a Markov 
chain and that properties of the system are described by a vector-valued 
function /. After a sufficiently long time, the average of the realized values 
of / converges to its expected value. In many practical situations, it is of 
great interest to determine how long it takes for the average to converge 
within specified bounds. In other words, we are interested in estimating 
the probability of a large deviation of the average from its expected value. 
Large deviation theory gives the asymptotic rate of convergence but is silent 
about explicit bounds. In the case of a scalar function, the first explicit 
estimate of the probability of a large deviation was given by Gillman [9] 
and was later improved by Dinwoodie [6] and Lezaud [14] . For vector- valued 
functions, we could proceed by applying one-dimensional estimates to each 
component of the function. If Sjy is a vector with m components and we 
want to estimate Pr{|Sjv| > sN}, then it is enough to estimate Pr{|S^| > 
e/y/mN}, where S % N is the ith component of the vector sum Sjy. Since one- 
dimensional inequalities have the form Pr{|S , ] v | > rjN} < C ex.p(—arj 2 N), 
our estimate will be 

Pt{\S n \ > eN} < Cmexp(-{a/m)e 2 N), 
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which has an exponential rate inversely related to m. It turns out that it is 
possible to improve on this inequality by deriving a genuine multidimensional 
inequality in which the rate function is dimension-free. 

To fix notation, let S be the state space of a finite Markov chain with 
transition matrix P and invariant distribution jjl. We will assume that the 
chain is reversible, that is, that fJ. s Pst = IhPu f° r an y s an d t from S. The 
transition matrix of a reversible chain is similar to a symmetric matrix (i.e., 
there exists a D such that D~ 1 PD is symmetric) and therefore enjoys many 
good properties of symmetric matrices. In particular, its eigenvalues are real. 
Let us denote the eigenvalues of P as Aj, where 

A = 1 > Ai > A 2 > • • • > A| S |_i > -1. 

The difference 1 — Ai is called the spectral gap of the chain. In our study, it 
will be the main indicator of how well the chain mixes the states. Finally, let 
/ be a function on § that takes values in an m-dimensional real Euclidean 
space, that is, in a vector space endowed with a scalar product (•,•) and 
the corresponding norm | • | . We study the behavior of partial sums Sn = 
J2tLif{ s t), where the sequence s\,...,sn is a realization of the Markov 
chain evolution. 

The behavior of the sum depends on the interaction of properties of the 
function and the Markov chain. We will use two parameters that characterize 
this interaction. We call them the I 00 -norm and the principal variance of 
/. The /°°-norm is defined as ||/||oo =: sup s |/(s)|. The principal variance is 
defined as follows. With each vector u, we can associate the variance of the 
random scalar product (f(s),u). The randomness comes from s, which is 
drawn according to the invariant distribution. The principal variance of / 
is defined as the supremum of these variances over all unit vectors u: 

cr 2 (/) =: sup J2^s(f(s),u} 2 
M =1 ses 

= supE[(/( S ),u) 2 ]. 
|u|=l 

(In what follows, we will always use symbols E and to denote the expec- 
tation values relative to the invariant and initial distributions on S, resp.) 
The principal variance measures the variation of the function / in the long 
run, when the distribution of f(s) is approximately invariant. The /°°-norm 
helps us to determine if the function has an outlier. Directly from the defi- 
nitions, it is clear that c 2 (/) < H/H 2 *,- 

The behavior of the partial sums Sn also depends on the initial distri- 
bution It is convenient to use the following measure of the distance 
between the initial and the invariant distribution: 

'ii^f-O 2 ]. 
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Here is the main result. 



Theorem 1. Suppose (1) P is reversible with spectral gap g, (2) Ef = 0, 
(3) H/lloo < L and (4) a 2 {f) < a 2 . For arbitrary e > 0, 



(1) 

where 



Pr{|SW| >eN} <3|||/i (0) /^|||2' m/ "e 



>m/2 -(l/(8fc))e 2 iV 



192 



125 log 2 [1+5/2]' 



In view of the inequality o~ 2 (f) < L 2 , we can take a 2 = L 2 and obtain the 
following estimate that involves only L. 



Corollary 2. Under the assumptions of Theorem 1, 
Pr{ \S N \ >eiV}<3| || ^ (0) /ft \ \\ 2 m/2 exp 

where 



e 2 ' 
aj - 2 N 



a 



8 1536 



4+- + 



9 



g 125 log 2 [l + 5 /2] 



Remarks. 1. Recall that one form of the Bernstein-Hoeffding inequality 
for i.i.d. and one-dimensional variables is 



(2) 



Pr{\S N \>eN}<2exp 



2L 2 



N 



(see, e.g., [10], Theorem 2). This inequality has the same form as the in- 
equality we formulated in Corollary 2, but a better exponential rate. For 
Markov chains and one-dimensional functions /, Gillman [9] showed that if 
|oo < 1, then 



(3) 



Pr{S N > sN} < 2||/i (0) //u|| exp 



-^-e 2 N 
20v 



where v is the spread of P, that is, v = max(/i)/ min(^). The inequality in 
Theorem 1 generalizes (3) to the case of multidimensional functions /. 

2. For a fixed m, the probability of large deviations declines exponen- 
tially with rate at least — (8k)~ 1 e 2 . Note that this bound on the rate does 
not depend on the dimension of the Euclidean space where / takes its val- 
ues. However, the dimension can significantly affect the term before the 
exponential, which grows exponentially in m. 
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Table 1 

Sample size needed to ensure that Pr{|5*jv / 'N\ > 0.01} < 5% 





Complet 


e graph 


Hypercube 




Circle 


Method 


to = 1 


m = 20 


rn = 1 


m = 20 


m 


= 1 


rn = 20 


Theorem 1 


4 mln 


9 mln 


9 mln 


22 mln 


560 


mln 


960 mln 


Martingale inequality 1 


280 mln 




280 mln 




300 


mln 




Gillman 


0.7 mln 


26 mln 


2 mln 


80 mln 


160 


mln 


2,640 mln 



1 While [11] derive bounds for vector- valued martingales, they do not provide explicit 
constants for their inequalities. 



Examples. In the following examples, we study random walks on graphs. 
We will assume that E(f) = and L = a 2 = 1. We ask how large N should 
be to ensure that the following inequality holds: 

Pr{\S N /N\ > 0.01} < 0.05. 

We will consider three examples: a complete graph, a hypercube and a circle. 
We will set the number of vertices equal to 32 in all examples to make them 
comparable. (In the example with the circle, we use 33 vertices to ensure 
that the chain is aperiodic.) We will also assume that the random walks 
start from the uniform distribution. The results are collected in Table 1. 

Example 3. Random walk on a complete graph. The most connected of 
all graphs is the complete graph, where each vertex is connected with each 
of the other vertices. We consider a random walk on a complete graph with 
n = 32 vertices. The spectral gap for this random walk is n/(n— 1) = 1 + 1/31 
(see [1] for derivation). 

Example 4. Random walk on a hypercube. Let the state space be the 
set of vertices of a 5-dimensional hypercube. With probability 5/6, the next 
state will be one of the 5 adjacent vertices and with probability 1/6, it 
remains the same. The spectral gap is g = 2/(5 + 1) = 1/3 (see [5] or [19]). 

Example 5. Random walk on a circle. We also consider a random walk 
on a circle that consists of n = 33 states. If the current state is x £ {1, . . . , n}, 
then the next state is x ± 1 mod(re), with probability 1/2 on each possibility. 
The spectral gap is g = 1 — cos(7r/n) 0.0045 (see [5] or [19]). 

We consider two dimensions, m = 1 and m = 20, and three methods. The 
first is from our Theorem 1, the second is given by Gillman's inequality, mod- 
ified to make it applicable to multidimensional situations, and the third is 
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the method of reduction to martingale inequalities. The following is a sketch 
of the third method in its application to a random walk on an n-vertex graph. 
Assume that the walk has been started from the invariant distribution. We 
can define = E(Sn\s\, . . . ,Sk), that is, the expectation of the sum Sn 
conditional on the first k realizations of the chain. Then F±,... ,Fpj form a 
martingale and Fn = Spj. For the application of the Bernstein inequality for 
martingale sequences, we need an estimate on — Using coupling 

arguments, it is possible to show that \F^ — is less than 2(n — 1)L, 

where n is the number of vertices in the graph. Therefore, for m = 1 we 
have the Bernstein inequality 



and for m > 1, similar inequalities are given by Kallenberg and Sztencel [11] 
(without explicit constants). Note that this method ignores how well the 
chain mixes and uses only the size of the graph to bound the probability of 
a large deviation. 

Table 1 shows that Gillman's inequality provides the best bounds for 
m = 1, but performs worse than the bound in Theorem 1 for m = 20. The 
martingale inequality underperforms other methods for both the complete 
graph and hypercube, but is better than the bound in Theorem 1 for the case 
of the circle. This leads us to the conclusion that the bound in Theorem 1 
is most effective for large dimensions and well-connected graphs for which 
the spectral gap is large. 

To put the problem in perspective, we shall sketch a history of the ques- 
tion. Apparently, the first version of a large deviation inequality for sums of 
i.i.d. random variables was proved by Bernstein in 1924 (see Paper 5 in [3]). 
Later, Bernstein's result was significantly clarified and improved by Kol- 
mogoroff [13], Chernoff [4], Prokhorov [17], Bennett [2] and Hoeffding [10]. 
In addition, Hoeffding [10] showed how the inequality can be extended to 
some classes of dependent variables and, in particular, to the case of mar- 
tingale differences. Prokhorov [18] proved the multidimensional analogue of 
the Bernstein inequality for i.i.d. random variables. The multidimensional 
analogue was also derived by Yurinskii [20] by a different method which is 
applicable to the case of random variables that take values in an infinite- 
dimensional Banach space. Later, the multidimensional large-deviation in- 
equalities were generalized to the case of martingale sequences in [11]. They 
showed that a martingale process with values in a Hilbert space can be repre- 
sented by a martingale process that takes values in the plane M 2 . This device 
allows reduction of the question of large deviations in many dimensions to 
the question of large deviations for two-dimensional martingale processes. 

For functions defined on the state-space of a finite Markov chain, large de- 
viations were first studied by Miller [15]. Very definitive and general results 



(4) 



Pr{|Sjv| >eiV}<2exp 




1 e 2 

2 [2(n- 1)L} 2 



) 
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in this direction were later obtained by Donsker and Varadhan [7]. They 
established the existence of the exponential rate of the decline in the proba- 
bility of large deviations and showed how to compute this rate. Their results 
are valid for vector-valued or even measure-valued functionals of Markov 
chains acting on very general state spaces. While results of this type are 
very useful for understanding the asymptotic behavior of large deviations, 
they do not provide explicit bounds on the probability of a large deviation 
in a finite sample. 

The first one-dimensional Bernstein-type inequality for finite Markov chains 
was proved by Gillman [9] (see also [6] and [14] for significant improvements). 
Gillman's method is to write 

Pr{5jv > eN} < E<°> exp(-9eN + 9S N ) 

= eM-OeN) J2 ^SP S0Sl e 6f{si) ■ ■ • P SN ^s N e 9 ^ 

= exp(-9eN)(^\[P(9)] N l § ), 

where P s t denotes the transition probability from state s to t, P{9) is a 
matrix with entries P s t = Pst e e ^\ M (°) is the initial distribution, Is is a 
function that takes value 1 on every state of § and (•,•) denotes a scalar 
product for functions on S. It turns out that P(9) is similar to a symmetric 
matrix and therefore its norm can be bounded in terms of its eigenvalues. 
Therefore, the main task is to estimate the eigenvalues of P(9), which can 
be done using Kato's theory of linear operator perturbations. Dinwoodie 
[6] and Lezaud [14] use a similar method and improve upon Gillman by 
employing more sophisticated and difficult versions of perturbation theory. 
Prior to Gillman, the method of a perturbed transition kernel was used by 
Nagaev [16] to study central limit theorems for Markov chains. 

Obviously, Gillman's method is not directly applicable to the case of vec- 
tor functions since we cannot develop Eexp(— 9eN + 9\\Sn\\) in the sum 
of products of exp||/(s)||. To circumvent this difficulty, we use an idea of 
Prokhorov [18], which was used to prove the multidimensional analogue of 
the Bernstein inequality for i.i.d. variables. The idea is to consider 
Eexp(— 9eN + 9{Sn, u)), where u is a random vector from an appropriate 
distribution, and later integrate it over the distribution of u. The advantage 
is that Eexp(— 9eN + 9{Sn, u)) can be developed as the sum of products of 
exp(/(s),u). Using this idea we are able to extend the Bernstein-Gillman 
inequality to vector functions. 

A large body of related literature studies the explicit rates of convergence 
of a Markov chain to its invariant distribution. For a review, see the book by 
Diaconis [5], the review paper by Saloff-Coste [19] and the dissertation by 
Gangolli [8]. Our problem is of a somewhat different flavor because, even for 
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a chain which starts in the invariant distribution, the problem of estimating 
the probability of a large deviation of the function sum is not trivial. 

The rest of the paper is devoted to the proof of the main result. It is 
organized as follows. Section 2 gives an outline of the proof and explicates the 
relation of our problem to the eigenvalue problem for a perturbed transition 
matrix. Section 3 applies a mixture of techniques from the Rellich and Kato 
perturbation theories to estimate the largest eigenvalue of the perturbed 
transition matrix. Section 4 concludes. 

2. Outline of the proof. Let 

F r (x) = / exp(x, u) d$(u), 



where x and u are vectors from an m-dimensional real Euclidean space and 
d$>(u) is the Gaussian measure with density 

12- 



^ ) = (2^F^ exp ("^)' 

;) explicitly 

F r (x) = e ( 



( 27rr 2)m/2 

We can easily calculate F r {x) explicitly: 

D (r 2 /2)\x\ 2 

Consequently, we can write 

Pr{|^| > eN} = p r { e (- 2 /2)|^P > e (rV2)|^| 2 } 

< e (-r 2 /2)| £ JV| 2 E (0)| e (r 2 /2)|5^| 2 | 



(5) 



; (-r 2 /2)| £ JV| 2 E (0) 



exp(SN,u) d&(u) 



e (-r 2 /2)| £ JV| 2 f [ E m exv (S N , U )]d$( U ). 



Consider now E^ ^ exp(S'jv, u) . We will write this expression as a quadratic 
form and show that what matters is the largest eigenvalue of this form. We 
will then show that a sufficiently good estimate on the eigenvalue would 
imply the inequality in Theorem 1 . The derivation of the eigenvalue estimate 
is given in the next section. 

Define the perturbed transition matrix as a matrix with the following 
entries: 

P st {u) = P st e«®' u l 

We denote its largest eigenvalue by Xo(u). Let (•, •) denote the scalar product 
(a, b) = J2 s a sbs, where s are states of the chain and a s and b s are scalar- 
valued functions of s. Also, let 1§ denote the scalar- valued function that 
takes the value 1 on all states. 
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Lemma 6. 



E^exp(S N ,u) = (^ \[P(u)] n l n ). 



Proof. We can write 



E(°)exp(5^,n)= £ ^ P S0S1 e^^ ■ ■ ■ P s 



SN-lSN 



e {f{s N ),u) 



s ,s 1 ,...,s N 




= (^\[P(u)] N l s ). 



□ 



A fortunate consequence of the reversibility of P is that matrices P and 
P(u) become symmetric in a coordinate system with dilated axes. This im- 
plies that matrices P and P (u) enjoy all of the good properties of symmetric 
matrices and, in particular, that their eigenvalues are real and their norms 
can be expressed in terms of the eigenvalue with the largest absolute value. 

The second instance of good luck is that both P and P{u) are nonnegative 
in the sense that all of their entries are nonnegative. This implies that the 
Perron-Frobenius theorem is applicable and we can pinpoint which of the 
eigenvalues has the largest absolute value. As we might expect, the largest 
eigenvalue has the largest absolute value. As a consequence, we are able to 
estimate the norm of P{u) in terms of its largest eigenvalue and therefore 
obtain a bound on the value of {^°\ \P(u)] N \§). 

Lemma 7. Let D = diagjyT^I} an d E u = diagjexp ^(f(s),u)}. Define 
S =: DPD^ 1 and S u =: E U SE U . Then (1) S and S u are symmetric, (2) S u is 
similar to P(u) and has the same eigenvalues as P(u), (3) the eigenvalues of 
P(u) are real and (4) the largest eigenvalue of P(u) has the largest absolute 
value among all eigenvalues of P(u). 

Remark. Here, S and E u denote matrices and should not be confused 
with the notation for the Markov chain, S, and for the expectation value, E, 
respectively. 

Proof of Lemma 7. First, the reversibility of P implies that S =: 
DPD^ 1 is symmetric. Indeed, 



Sji — /i j I'jifij 



-1/2 D - 



1/2 




1/2 
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Then S u = E U SE U is symmetric because E u is symmetric. It is similar to 
P{u) because 

P{u)^PEl = D~ 1 SDEl 

= D- 1 E~ 1 (E U SE U )E U D 
= (EuD^SuiEuD), 

where we have used the commutativity of D and E u . Consequently, S u and 
P(u) have the same eigenvalues. The eigenvalues of S u are real because S u 
is symmetric. Therefore, the eigenvalues of P(u) are also real. Finally, P(u) 
has nonnegative entries and therefore, by the Perron-Frobenius theorem, its 
largest eigenvalue has the largest absolute value. □ 

Lemma 8. If the chain P is reversible, \u\ < 1 and \ f(s)\ < 1 for any s, 
then 

(^\[P(u)] N l s )<3\\\^/^\\\X (u) N . 

Proof. Since S u is symmetric and its largest eigenvalue has the largest 
absolute value, then \\S U \\ < \o(u). Therefore: 

(/A [P(u)] N l) = (^°\e u d)-\sZ(e u d)i § ) 

^XoiufW^iEuDr'WUEuD)^]], 

where || • || denotes the norm corresponding to the scalar product (•,•). 
Then 

,(o) ia \i/a 
\^>(E U D)^\\ = [ V-^±exp(-/( S ),«) 



s ^ s 

= vWHII, 

where we have used the fact that \{f(s),u)\ < \f(s)\\u\ < 1 and consequently 
exp(±/(s),u) < 3. Similarly, 

|| (^D) l s || = ^]/i s exp 
Combining, we get 

^ 0) ,[P(u)] N ls)<3\\\^/»\\\\o(u) N . □ 
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Suppose, for the moment, that we have managed to establish the inequal- 
ity 

Ao(«) < ex.p(k\u\ 2 ). 
Then, using Lemmas 6 and 8, we can write 

[Eexp(5jv,u)]d$(u) < 3||^ (0) /HI / \ (u) N d<f>(u) 

<3||^ (0) /HI J exp(A;|u| 2 iV) d$(u) 

= P^4 [eMk\u\"N)e~^^ 2 Uu. 
(2vrr 2 ) m /2 J ^ v 1 1 1 

In spherical coordinates, we can rewrite this expression as follows: 
(2vrr 2 ) m /2 r(m/2 + 1) ' 



3\\n^/fj,\\m 
2 m / 2 r(m/2 + 1) 7 



s m-l e fcr 2 s 2 7V-s 2 /2 ^ 



where we use the fact that the surface area of the unit sphere in m-dimensional 
real Euclidean space is mir m / 2 /T ((m/2) + 1). Next, set 

(6) r = (2VkNy\ 
Then 

/ s m - 1 e kr2s2N e- s2 l 2 ds= s^e-^'Us. 

Jo Jo 

Making the substitution t = s 2 /4, we compute 

/"OO - /"OO 

/ s™-^" 8 l A ds = 2 m ~ l / t^^e^dt 
Jo Jo 



2 m-l r | __ 



/7? 



So, combining, we obtain 

/ n\\m2 m - 1 T(m/2) 



y [Eexp(Sjv,«)]d$(it) 



< 



2 m /2r(m/2 + 1) 

= 3|| / u(°VH|2 m/2 . 
Substituting this and (6) into (5), we obtain 

Pr{| SaH > eN] < 3\\^/ f ,\\2 m / 2 e- 1 /^ £2N , 
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which is the desired inequality. 

In the above, we have assumed that ||/||oo < 1- In the general case when 
||/||oo < L, we simply introduce the auxiliary function g = f /L. Then 

Pr{|/i + • • • + f N \ > eN} = Pr + • • • + g N \ > itf} 

and the latter probability can be estimated if we observe that H^Hoo < 1 and 
o\g) = a\f)lh\ 

It remains to derive the required estimate on the eigenvalue \§{u). 

3. A bound on the largest eigenvalue of the perturbed transition matrix. 

We need to estimate the largest eigenvalue of the perturbed transition matrix 
P(u) = Pdiag(exp(/(f), u)). In the following, we use the notation P(z) = 
P(zu), where u is a fixed vector of length 1. Our main concern will be real 
values of z which lie in the interval [0, oo), but we will also need to consider 
the complex values of z. It is known that if A.; is an eigenvalue of P of 
multiplicity 1, then there is a complex-analytic function \i{z) defined in a 
neighborhood of z = such that Xi(z) is an eigenvalue of P (z). This function 
is called the perturbation of the eigenvalue A. We will consider this function 
for i = 0. 

It will be clear from the following discussion that for all sufficiently small 
z, say, for \z\ < rr, there exists a circle around Xq(z) such that P(z) has 
no eigenvalues in this circle except Xo(z) itself. Since for real positive z, 
the largest eigenvalue of P(z) must be real and positive (by the Perron- 
Frobenius theorem) and since initially at 2 = 0, Ao is the largest eigenvalue, 
we can conclude by continuity that when z changes from zero to rr along 
the real line, the largest eigenvalue of P(z) remains Xq(z). Therefore, for this 
range of z, the desired estimate for the largest eigenvalue of P(z) follows 
from an appropriate estimate for Xq(z). This estimate will be obtained from 
Kato perturbation theory. For larger values of the perturbation parameter 
z, we will use a different method which bounds all eigenvalues of P(z) at 
once. 

We know that Ao(0) = 1 and it is easy to show that Aq(0) = 0. It is also 
relatively easy to bound the second derivative of Xq(z) at z = 0. It is some- 
what more difficult to estimate the remainder Xq{z) — 1 — Aq(0)z 2 in an open 
neighborhood of z = 0. We will establish an estimate by studying the resol- 
vent of the perturbed operator in the complex z-plane (the Kato method, 
see [12]). 

For convenience, we shall call the following set of conditions Assump- 
tion A: 

1. P is a reversible chain with spectral gap g; 

2. Ef(s) = 0; 



12 V. KARGIN 

3. The principal variance of / is a 2 ; 
4- < 1 for each s. 

In the following, we always suppose that Assumption A holds. The main 
result of this section is the following estimate. 

Proposition 9. 

Xo(v)<e k ^\ 

where 

First, we estimate Aq(0) and Aq(0). 
Lemma 10. A' (0) =0. 

Proof. Matrix P(z) can be developed as a power series in z: 

(8) ^)=^(e VA 

V n=0 Tl - J 

where 

y = diag{(/(t),u)}. 
Let the expansions for Xo(z) and the corresponding eigenvector, X(z), be 
A (^) = l + A / (0)z + iA"(0)z 2 + ---, 
X(z) =n + X'(0)z + \X"{0)z 2 + ■■■. 
Writing the equality X{z)P{z) = \q{z)X{z) in powers of z, we obtain 

fj,P = fi, 

(9) 

X'(0)P + 11PV = A'(0)/i + X'(0). 

Multiply the last line by 1§ on the right and use the facts that Pl§ = 1§ 
and = 1. (Recall that 1§ is a scalar- valued function that takes the value 
1 on all states.) We then obtain 

However, 

(10) ^is=5>»(/(«).u> 

s 

(11) =(Ef,u)=0, 
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by assumption. Therefore, A'(0) = 0. □ 

We also require some information about the perturbation of the eigenvec- 
tor, in particular, about X'(0). From (9), X'(0) must satisfy the following 
equation: 



It is tempting to write X'(0) = (/ — P)~ l fiV. However, I— P is not invertible, 
which is reflected, for example, in the fact that if a vector X' satisfies equa- 
tion (12), then X' + a\x also satisfies it. We need to impose one additional 
constraint to determine the solution. We choose a normalization in which 
X'{0) is the unique solution of (12) that satisfies the additional constraint 



To solve (12), we need a pseudo-inverse of I — P. The traditional pseudo- 
inverse is not appropriate because, first, P is not symmetric and second, we 
use a nonstandard normalization of the solution. An appropriate concept of 
the pseudo- inverse is as follows. 

Let 1§ be the subspace of vectors orthogonal to 1§. This subspace is 
invariant under the right action of P. Indeed, if xlg = 0, then xPl§ = xl§ = 
0. We define the pseudo-inverse operator (I — P)t as the inverse of I — P on 
lg and as on 1§. 

If P is reversible, then P = D~ 1 SD, where S is symmetric. Since the 
subspace lg is invariant under the right action of P, the subspace lg D~ l 
is invariant under the right action of S and we can define (J — Sy, which 
is the inverse of / — S on lg L> _1 and is zero on lgD -1 . Note that (/ — S)^ 
and S commute and that D~ l (I - S^D = (J - P)t. 

Lemma 11. X'(0) = /iV(J - P) 1 " = ^VD~ l {I - S) ] D. 

Proof. By (10), fiV e lg. Therefore, the product /j,V(I - P)^ satisfies 
equation (12) and belongs to lg. Consequently, it coincides with X'(0). □ 

Now, consider the second derivative of the eigenvalue function. 

Lemma 12. A' '(0) < (1 + 2/g)a 2 . 

Proof. Let us equate z 2 terms in the expansion of the equality X(z)P(z) = 
X(z)X(z), taking into account that A'(0) = and \xP = fi: 



Multiplying this equality by 1§ on the right and using the fact that _P1§ = lg, 
we obtain the following formula for A"(0): 



(12) 



X'(0)(I-P)=fiV. 



that (X'(0),1 S ) = 0. 



\X"(0)P + X'(0)PV + \iN 2 = iA£(0)/x + %X"(0). 



(13) 



A'o'(O) =fiV 2 l§+2X'(0)PVl&. 
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Consider the absolute value of the second term in (13): 

\X'(0)PV1 S \ = \fiVD- l {I - 5) t SWl s | 

< ||(/-5) t 5|||| Ai y J D- 1 ||||Wls||, 

where we used Lemma 11 and the equality P = D^SD. [Here, we use 
|| • || to denote both the norm of a function on § and the norm of an 
operator that acts on these functions: by definition, ||/|| = (/, f) l l 2 and 
||A||=sup im | =1 ||4f||.] 

The operator (/ — SyS is symmetric with eigenvalues which are either 
zeros or Aj/(1 — Aj), where i > 1. Consequently, 

\\{I-S)^S\\<- g . 

Next, 

||wi s || = (x> 8 </(*)>< 

and 

\\VVD- X \\=(^n a (f{t 
where we used the fact that D = diag{^///7}. Combining, we have 

2 

\X'(0)PV1 S \ < —. 

9 

Finally, for the first term on the right-hand side of (13), we have 





l^ 2 l 



,«) 2 



<a 2 



and therefore 



:o J, (i + - 

9 



□ 



We now turn to the estimation of the residual Ao(^) — 1 — Aq(0)z 2 . The 
following is a quick excursion in Kato's theory of perturbations. The resol- 
vent of the perturbed operator P(z) is defined as R(C, z) = [P(z) — C] _1 - We 
want to estimate the change in eigenvalues of P(z) when z changes. For this 
purpose, we study how the resolvent of P(z) depends on z. 

Let us, for economy of space, write 

A{z) =: P(z) -P = PiVz + \V 2 z 2 + •••)• 
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We can write 

P(z)-( = P-( + A(z) 

= (P-Q[l+R(OA(z)] 

and consequently, 

R((,z) = [l + R(()A(z)]- 1 R(0. 
The power series for [1 + i?(C)^4(z)] _1 is 

oo 

(14) [l + R(C)A(z)]~ l = ^[R(OMz)] n - 

n=0 

R(C, z) is nonsingular if this power series is convergent, which holds if 

(15) \\R(OA(z)\\ sp <l, 
where || • || sp denotes the spectral norm, 

||X|| sp =:lim sup \\X n \\ l l n . 

Recall that the reversibility of P implies that it can be represented as 
P = D~ 1 SD, where D = diagjyT^I} an d S is symmetric. Let us denote 
(S-Cr'byRsiC). 

Lemma 13. The power series (14) for [1 + R(QA(z)]~ 1 converges if 
|^| <log(l + || J Rc ? (C)5'|r 1 ). 

Proof. In our case, the perturbation is 

A(z) = P(e zV -1), 

where V = diag((/(s), u)). By criterion (15), we should determine when 
\\R(()P(e zV — l)\\ sp < 1. For reversible P, we can write 

fl(C)P = (P-C) _1 P 

= D- 1 (S-()~ 1 SD 

Using the fact that both D and (e zV — 1) are diagonal and therefore com- 
mute, we can further write 

R(C)P(e zV - 1) = D^RsiQSie** - 1)D. 

Next, we use the property of the spectral norm that it is not changed by 
similarity transformations and write 

||P(C)P(e zV - 1)|| SP = \\Rs(C)S(e zV - l)\\ sp 

<\\R s (()S(e zV -l)\l 
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where we also used the fact that the spectral norm is bounded from above 
by the usual operator norm. We can continue as follows: 

oo 

\\Rs(C)S(e zV - 1)|| < \\R S (C)S\\ £ y\z\ k \\V k \\. 

k=l 

From assumptions on u and f(s), it follows that ||V|| < 1 and consequently, 

\\Rs(C)S(e zV -l)\\<\\Rs(OS\\(e^-l). 
This expression is less than 1, provided that \z\ < log(l + || J R S (C)5'||" 1 ). □ 

In the following, it is useful to keep in mind the distinction between the 
£-plane, where the spectral parameter £ lives, and the z-plane, where the 
perturbation parameter z lives. 

Lemma 14. Let T be a circle of radius r^ in the (,-plane whose interior 
contains exactly one eigenvalue of P, Xq = 1. Define 

r^minMl + H^^^ir 1 ). 

Then for every z in the z-plane such that \z\ < r z , there is exactly one eigen- 
value of P(z) inside T [i.e., the eigenvalue Xo(z) of the perturbed matrix 
remains inside T]. 

Moreover, for a £ (0,1), the eigenvalue function Xo(z) is holomorphic in 
the disc \z\ < (1 — a)r z and its third derivative inside the disc can be esti- 
mated as follows: 

\K'(*)\<^ r 4- 

a A r | 

Intuitively, if the resolvent Rs(C) is small in magnitude, then we can be 
sure that for perturbations less then r z , the eigenvalue Ao(^) does not move 
far from Ao(0) and there are no other eigenvalues near Ao(^). The size of r z 
is inversely related to the size of Rs(C)- 

Proof of Lemma 14. Let D be a circle in the z-plane with center at 
and radius r z = log(l + ||i?s(C)£'||~ 1 ). Consider an arbitrary zq inside D. 
We can connect z = and zq by a curve A that lies completely inside the 
circle D. When we change z along this curve, the eigenvalues of the operator 
P(z) follow paths that never intersect the circle T — we know this because 
by Lemma 13, the power series for the resolvent R(Ci z ) always converge for 
all C £ r. Consequently, the number of eigenvalues of the operator P(z) that 
are located inside T is conserved along the path A. It follows that P{zq) has 
exactly one eigenvalue inside V. 
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For the second part of the lemma, take an arbitrary zq such that \zq\ < 
(1 — a)r z . Then exactly one eigenvalue of P (zq) lies inside T. Consider the 
circle Do with center at zq and radius ar z . This circle lies entirely inside the 
circle D and consequently, for any z £ Do, there is only one eigenvalue of 
P(z) inside T. Hence, 

\X (z) - X (z )\ <2r c . 

Recalling that A(x) is holomorphic (see [12]), we can estimate its third 
derivative at z$ by using Cauchy's inequality: 

i \m/ v \ i ^ R ma^g6gg |A(z) - A(z )| _ 2r c 12 r c 

|A (z )\ < 6 j— — p - - ^ ^ . n 

Lemma 15. Lei r 6e a circle of radius rp = g/2 around Xq = 1. TTien 

maxp s (C)S||=|. 

Proof. Since S is similar to P, it has the same eigenvalues. Since 5 
is symmetric, Rs(C)S is also symmetric and its norm coincides with the 
largest absolute value of its eigenvalues. Further, Rs(C)S has eigenvalues 
(Aj — C)~ 1 Aj. It is easy to see that if £ £ T, then the maximum is reached for 
i = and Co = 1 — g/2. A calculation gives 

\\Rs((o)S\\ = - g . n 

Lemma 16. Take a £ (0, 1). T/ien /or any z in the disc \z\ < (1 — a) log(l + 
g/2), the following inequality holds: 



A "(,)|<§log- 3 



a 

Proof. From Lemma 14, 



1 + 2 
2 



|A "(*)| < 

a 13 

Take r^ = g/2 and apply Lemma 15 to obtain 

r^mmloga + II^OSir 1 ) 

= log(l + f 



Therefore, 

|A "(,)|<5log-^l + 
Combining the previous lemmas, we obtain the following result. 



a^ l0g [ 1+ 2j- 
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Lemma 17. Take a £ (0, 1). Then for any z inthe disc \z\ < (1 — a) log(l + 
g/2), the following inequality holds: 



|Ao(*)|<e A 



where 



1 1 



1 — a 



a 3 log 2 [l + ff /2] 



Proof. First, using Lemmas 12 and 16, we write 



\K(z)\< o-*(l + -) + 



9 



o 



\%(t)dt 



o- 2 [l + -) +-^glog- 3 
5/ a 



1 + 



Then, using Lemma 10, we get 



VA 



\X' (z)\< \X»(t)\dt 
Jo 



<a z 1 + - )\z\ + ^glo£ 



ex" 



and 



\\o(z)\<l+ \X' (t)\dt 
Jo 

^ i+ct2 (H) n2+ ^ 



1 + 



Using the condition \z\ < (1 — a)log[l + g/2], we further reduce this to 



|AoW| < 1 + 



1 — a 



-g log- 



ex" 

„2 - t 2 



This inequality and the inequality 1 + x < e x together imply the claim of 
the lemma. □ 



We should now treat the case when z is real and greater than (1 
a)log(l + <7/2). 



Lemma 18. For every real z > 0, 

|A (*)|<e* 
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Proof. Recall (from Lemma 7) that P(z) has the same eigenvalues as 
S(z), where S(z) = E z / 2 SE z / 2 , E z j 2 = diagexp(| (/(£), u)) and u is a vector 
of unit length. It follows that the absolute value of the largest eigenvalue 
does not exceed ||S(2)|| < ||^/2|| 2 ||5|| < e z , where we used the assumption 
that \f(t)\ < 1 to bound \\E z/2 \\. □ 



Lemma 19. For every real z > 0, 

(16) |A (z)|<e fc W 2 , 
where 

H7\ i. 2/1 , 1\ , 192 9 

(17) k = a {2 + g) + m^[l +g /2y 

Proof. Take a = 5/8. Then by Lemma 17, inequality (16) with rate 
(17) holds for \z\ < (3/8) log(l + g/2). However, for \z\ > (3/8) log (1 + g/2), 
we have 

192 q \ 3, . 

■log(l + g/2)\z\ 









(-1 












72 




9 



" 1251og[l + ( 7 /2] 1 

and using Lemma 18, we conclude that inequality (16) with rate (17) is valid 
for all real z > 0. □ 



The claim of Proposition 9 follows if we take z = \v | and u = v/\v\ in 
Lemma 17. As was shown in the previous section, the validity of the in- 
equality in Proposition 9 implies the validity of Theorem 1. 

4. Concluding remarks. We have derived an inequality for the probabil- 
ity of large deviations of vector-valued functions on a finite Markov chain. 
The results can be extended in two directions. First, it is desirable to elimi- 
nate dependence on the dimension in the term before the exponential. Cor- 
responding results for i.i.d. and martingale variables suggest that this is 
possible. Second, it would be desirable to extend the results to denumerable 
Markov chains and, in particular, to random walks on denumerable groups 



Acknowledgment. I would like to thank Diana Bloom for her editorial 
help. 



20 



V. KARGIN 



REFERENCES 

[1] Aldous, D. and Fill, J. (2006). Reversible Markov chains and ran- 
dom walks on graphs. Monograph in preparation. Available at 
http : I ' l '128 . 32 . 135 . 2/users/aldous/RWG/book . html . 

[2] Bennett, G. (1962). Probability inequalities for the sum of independent random 
variables. J. Amer. Statist. Assoc. 57 33-45. 

[3] Bernstein, S. (1952). Collected Papers. Izdat. Acad. Nauk SSSR, Moscow. (In Rus- 
sian.) MR0048360 

[4] Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis 
based on the sum of observations. Ann. Math. Statist. 23 493-507. MR0057518 

[5] DlACONlS, P. (1988). Group Representations in Probability and Statistics. IMS, Hay- 
ward, CA. MR0964069 

[6] Dinwoodie, I. H. (1995). A probability inequality for the occupation measure of a 
reversible Markov chain. Ann. Appl. Probab. 5 37-43. MR1325039 

[7] Donsker, M. D. and Varadhan, S. R. S. (1975). Asymptotic evaluation of certain 
Markov process expectations for large time. I, II. Comm. Pure Appl. Math. 28 
1-47, 279-301. MR0386024 

[8] Gangolli, A. R. (1991). Convergence bounds for Markov chains and applications 
to sampling. Ph.D. dissertation, Stanford Univ. 

[9] Gillman, D. (1993). Hidden Markov chains: Convergence rates and the complexity 
of inference. Ph.D. thesis, MIT. 
[10] Hoeffding, W. (1963). Probability inequalities for sums of bounded random vari- 
ables. J. Amer. Statist. Assoc. 58 13-30. MR0144363 
[11] Kallenberg, O. and Sztencel, R. (1991). Some dimension-free features of vector- 
valued martingales. Probab. Theory Related Fields 88 215-247. MR1096481 
[12] Kato, T. (1980). Perturbation Theory for Linear Operators. Springer, Berlin. 
MR0407617 

[13] KOLMOGOROFF, A. (1929). Uber das Gesetz des iterierten Logarithmus. Math. Ann. 

101 126-135. MR1512520 
[14] Lezaud, P. (1998). Chernoff-type bound for finite Markov chains. Ann. Appl. Probab. 

8 849-867. MR1627795 
[15] Miller, H. D. (1961). A convexity property in the theory of random variables defined 

on a finite Markov chain. Ann. Math. Statist. 32 1260-1270. MR0126886 
[16] Nagaev, S. V. (1957). Some limit theorems for stationary Markov chains. Theory 

Probab. Appl. 2 378-406. MR0094846 
[17] Prokhorov, Y. V. (1959). An extremal problem in probability theory. Theory 

Probab. Appl. 4 201-203. MR0121857 
[18] Prokhorov, Y. V. (1968). An extension of S. N. Bernstein's inequalities to multi- 
dimensional distributions. Theory Probab. Appl. 13 260-267. MR0230353 
[19] Saloff-Coste, L. (2004). Random walks on finite groups. In Probability on Discrete 

Structures. Encyclopaedia of Mathematical Sciences (H. Kesten, ed.) 100 263- 

346. Springer, Berlin. MR2023654 
[20] Yurinskii, V. V. (1970). On an infinite-dimensional version of S. N. Bernstein's 

inequalities. Theory Probab. Appl. 15 108-109. MR0268941 



A LARGE DEVIATION INEQUALITY FOR VECTOR FUNCTIONS 



21 



Courant Institute 

of Mathematical Sciences 
109-20 71st Road 
Apt. 4A 

Forest Hills, New York 
USA 

E-MAIL: kargin@cims.nyu.edu 



